Bulletin of Electrical Engineering and Informatics 
Vol. 13, No. 3, June 2024, pp. 1961~1969 
ISSN: 2302-9285, DOI: 10.1159 1/eei.v1313.5708 Oo 1961 


Refining disparity maps using deep learning and edge-aware 
smoothing filter 


Shamsul Fakhar Abd Gani!”, Muhammad Fahmi Miskon!, Rostam Affendi Hamzah?, Mohd Saad 
Hamid?, Ahmad Fauzan Kadmin?, Adi Irwan Herman? 
'Fakulti Kejuruteraan Elektrik, Universiti Teknikal Malaysia Melaka, Malaysia 
*Fakulti Teknologi Kejuruteraan Elektrik dan Elektronik, Universiti Teknikal Malaysia Melaka, Malaysia 
3Product and Test Engineering, Texas Instruments (Malaysia), Melaka, Malaysia 


Article Info ABSTRACT 
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of resilience against radiometric variation and edge inconsistencies. In this 
article convolutional neural network (CNN) is employed in the first stage to 
Keywords: generate the raw matching cost, which is subsequently filtered with a 
bilateral filter (BF) and applied with cross-based cost aggregation (CBCA) 
during the cost aggregation stage to enhance precision. Winner-take-all 
(WTA) strategy is implemented to normalise the disparity map values. 
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Disparity map Finally, the resulting output is subjected to an edge-aware smoothing filter 
Edge-aware filter (EASF) to reduce the noise. Due to its resistance to high contrast and 
Stereo matching brightness, the filter is found to be effective in refining and eliminating noise 


from the output image. Despite discontinuities like adiron's lost cup handle 
or artl's shattered rods, this approach, based on experimental research 
utilizing a Middlebury standard validation benchmark, yields a high level of 
accuracy, with an average non-occluded error of 6.79%, comparable to other 
published methods. 
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1. INTRODUCTION 

Light detection and ranging (LIDAR) is a remote sensing approach that employs a pulsed laser to 
detect distances and build three-dimensional (3D) discrete surfaces of the real world. Stereo vision addresses 
LIDAR's hardware, interface, and power issues using left and right cameras to take images of the same target 
from various angles and match pixels to calculate depth. Stereo vision estimates depth via triangulation. By 
knowing the distance between the cameras and the angle between their optical axes, the depth of any point in 
the scene that matches in both images can be found by using trigonometry as shown in (1): 


z= (1) 


d 
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where z is the depth, f is the camera focal length, b is the baseline space in the middle of cameras’ optical 
centre, with d as disparity. Stereo vision has many applications in fields such as robotics, 3D object/face 
recognition [1]-[7], and virtual reality. Stereo vision struggles with occlusion, illumination, noise, calibration 
errors, and computing complexity. Many methods and approaches have been developed to increase stereo 
vision system accuracy and efficiency. Scharstein et al. [8] suggested a common outline which was executed 
by utilizing a sequenced multi-stage system shown in Figure 1. Rectified image pairs are fed to the 
framework. Then, calculate the cost function to quantify patch similarity in the input image pair. The second 
step aggregates costs and filters noise [9]-[11]. Then, winner-take-all (WTA) strategy is employed to select 
the disparity with the lowest cost while discarding the others. Finally, disparity refinement is achieved by 
applying optical low-pass filter among other refinement methods. 


Step 1: 


Step 2: Step 3: Step 4: 


Matching Cost Disparity Disparity Disparity 
Cost Image 


Aggregation Computation Refinement 


Computation 


Figure 1. The stages of the stereo vision framework [8] 


Local approaches utilise disparity based on the relationship between pixel intensities (grayscale, 
RGB colours, texture patterns) inside a certain local support window based on the connections between 
pixels in close proximity to one another in the corresponding image [12]. These methods are sometimes 
referred to as window-based or region-based approaches. Locally established techniques include sum of 
absolute differences (SAD) [13], sum of squared differences (SSD) [14], and normalised cross-correlation 
(NCC) [15]. Local stereo matching techniques estimate the disparity by comparing the surroundings of a 
pixel p in the left image to the surroundings of a pixel q in the right image, where q has been translated across 
a potential disparity p. Local method evaluation leads in a quick runtime, but unfortunately, the quality is 
compromised, particularly in the region of depth discontinuities. 

Global optimization methods on the other hand address the disparity problem as a reduction of a 
predetermined global energy function [16]. It typically requires additional resources and is less susceptible to 
local variance. The evaluation is computed using global data and a smoothing threshold for neighbouring 
pixels. Markov random field (MRF) has generated a variety of answers to the problem of global energy 
reductions. These approaches can either be classified as graph cut (GC) or belief propagation (BP) methods. 
GC generates the lowest energy solution by integrating the smallest cutoff and maximum flow of the 
retrieved MRF graph. In contrast, the BP technique reduces the energy function by continually transmitting 
signals from the present node to neighbouring nodes within the MRF network [17]-[20]. 

Since [21] introduction of convolutional neural network (CNN) trained on pairs of tiny image 
patches with known real disparity, interest in a deep learning-based stereo vision system has increased 
substantially. Some post-processing tweaks are still required for these approaches. Several deep-learning 
based stereo matching algorithms, like DispNets [22], GCNet [23], PSMNet [24], Gwe [25], EdgeStereo 
[26], LEAStereo [27], HITNet [28], and EAI-Stereo [29] minimise post-processing stages and improves 
stereo matching performance. 


2. METHOD 

Figure 2 is a block diagram of the proposed method for the experiment, with the 4 boxes 
representing the adapted 4 stages from Figure 1. Edge-aware smoothing filter (EASF) [30] is performed to 
remove noise created from filling in process. 

— Matching cost computation: the stereo matching method starts by extracting the height, width, and 
number of disparities from the calibration file. The left and right images are then loaded using the 
OpenCV library, and pre-processing techniques like scaling, grayscale conversion, and normalisation are 
applied. The image is then processed by a trained CNN model implemented with tensorflow to obtain the 
cost volume, a 3D matrix containing the matching cost for each pixel and disparity level. 

As shown in Figure 3, a series of square-shaped kernels make up the convolution layer, and despite 
their small size, these filters can fit the whole depth of the volume. Each kernel reads the input region, adds 
the dot product, and stores the result in the activation layer. The input volume for the succeeding network 
layer is then constructed by integrating the feature map layer. The computing cost of adding more rectified 
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linear unit (ReLU) layers to a CNN grows linearly [31], [32]. Pooling layer gradually reduces the input 
dimension, reducing system parameters and processing, and aiding in the prevention of fitting problems. 


viin 


Figure 2. The proposed method 


JH 


Interpolate + 
LR Check + 
Fill-in + EASF 


Disparity Image 


Right Image 


2 
Q [e] 
E 2 
af} ate : = 
Left patch Convolution g za 2 
Q S 
—=—=, Š E 
ô E 
N 
Bikes - i 
= Fully-connected 
Right patch 
BP Convolution 


Figure 3. The structure of MC-CNN [13] 


The cost of matching may therefore be simply computed based on the CNN output. Based on [25], 
Cenn reflects the cost value as (2): 


Conn (Pp, d) = —s(< P*(p), P®(p — d) >) (2) 


where P is the patch for L (left) and R (right), p is the (x, y) position, and d is the disparity. The output of 
layers L1, L2, and L3 need to be computed only once per location p and need not be recomputed for every 
disparity d. Using a patch-based technique, features for both the left and right images are computed, utilising 
the left and right images as inputs along with the patch height and patch width. From there, the cost volume is 
computed by taking the left and right feature matrices along with the expected maximum disparity (ndisp) 
and calculating the matching cost for each potential disparity level by comparing the left and right features 
and iterating over all possible disparities, which range from 0 to ndisp-1. For each disparity, the function 
calculates the dot product of the features at each pixel in the left image and the corresponding pixel in the 
right image shifted by the disparity. 

— Cost aggregation: in cost aggregation stage, cost filtering is first done using BF which is useful for 
removing noise while preserving edges [33]. The BF takes in three arguments: the input image, the size of 
the filter kernel (diameter of each pixel neighborhood), and two parameters controlling the filter's range 
of influence: sigma color and sigma space. Sigma color controls how much the pixel values can differ 
while still being considered neighbors, and sigma space controls how far away the pixels can be while 
still being considered neighbors. The function loops over each disparity level d and applies the bilateral 
filter to the 2D slice of the cost volume at that disparity level. The filtered slice is then stored in the 
corresponding slice of a new 3D array with the same dimensions as the raw cost volume. 

Cross-based cost aggregation (CBCA) shown in Figure 4 [34] is applied to the filtered cost volume 
to further refine the disparity map by computing the cross region of the image. The cross region is 
constructed into vertical and horizontal regions for efficiency, and the function returns a numpy array 
representing the cross union region and a numpy array representing the number of elements in each position 
of the cross union region. CBCA aggregates the matching costs of neighboring pixels and disparities to 
reduce errors caused by occlusions and noise. First the disparity map is initialized with the minimum cost 
disparity for each pixel. Then, for each pixel, the algorithm selects the disparity with the minimum cost and 
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applies a weighted median filter to the neighboring pixels and disparities that have a similar cost. This 
process is repeated for several iterations to refine the disparity map further. 


up arm of p 


=> support region of p 
horizontal arms of q 


left arm of p right arm of p 


bottom arm of p 


Figure 4. Cross-based cost aggregation 


In (3) shows the BF function used in this paper. 
1 
BF[I]p = m Uaes Gos (Ip a q|)Gor (CNN, E CNNg)q (3) 


where Gos is ‘space’ parameter which determines the positive effect of faint pixels, Gor is ‘range’ parameter 

which determines the impact of pixel q having a concentration amount varying from J,. 

— Disparity computation: disparity computation determines the specific combination of disparities for which 
normalisation of disparity values occurs. In this stage, the most common local technique, WTA 
optimization, is implemented which is shown in (4): 


dp = arg mingep C(BF[/],, d) (4) 


where dp, is the disparity with the least expensive quantity selected, C(BF/I]p,d) is the cumulative cost 

applied with CBCA and BF in the second stage, D is the set of all acceptable individual disparity. The index 

of the minimum matching cost is obtained using the np.argmin() function and assigned as the label for the 
pixel in the respective label map. WTA labels each pixel with the disparity with the lowest matching cost 
without considering neighbouring pixels, which can produce noisy results in textureless or occluded regions. 

— Disparity refinement: in the last stage of the framework, standard approaches for post-processing and 
disparity refinement were implemented. Peng et al. [35] integrated the mean shift and super-pixel with 
segmentation (SEG) approach which clusters the disparity map according to colour. 

— Weighted median filter (WMF) is implemented where it essentially combines box aggregation with a 
weighted median, effectively eliminates noise from outliers while conserving the edges [36]. Median 
filter (MF) was utilised in [35]. MF is accurate at the edges but inaccurate in low-textured regions. BF 
was used to enhance the edge quality, despite the fact that its computation took longer than that of other 
filters [37]. Disparity maps may have missing pixels due to occlusions, depth discontinuities, or noise. 
Interpolation fills these gaps using left-right consistency check and median filtering. Mismatches are 
interpolated with median value of nearest matching neighbors, while occlusions use disparity value of 
nearest matching neighbor on right (or left) or raw disparity value if no match found. The quality of the 
disparity map is enhanced further through the application of EASF which is a modification of the 
traditional edge-preserving filter [38], [39] that iteratively applies the filter, smoothing the image while 
preserving its edges [40]. The idea is that the amount of smoothing between two pixels should depend on 
how far apart they are. The further apart they are, the less smoothing should occur between them. This 
can be expressed mathematically using a distance metric in a 5D space. If a transformation is applied that 
preserves these distances, then the edge-preserving property of the filter will also be maintained even if 
the image is represented in a lower-dimensional space. The recursive nature of this filter allows it to 
better preserve the edges in an image while still removing noise. This is because the filter is able to refine 
its output over multiple iterations, gradually removing noise while preserving the important features of 
the image. 
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3. RESULTS AND DISCUSSION 

Experiment is run on Windows 10 computer with Intel Xeon 2.80 GHz CPU, Nvidia Quadro P1000, 
and 8 GB DDR4 RAM. Project is written in python, utilizing libraries such as tensorflow, pytorch, and keras, 
making it easy to construct and train deep learning models. Middlebury v3 benchmarking system is used to 
evaluate the error percentage and determine the accuracy. The deep learning component maintains the same 
default parameters as in the experiment done by [21]. Table 1 compared shelvs, jadepl and recyc images 
before and after the introduction of EASF. It was demonstrated that EASF generated disparity maps with a 
lower mean error, and the images displayed an increase in edge preservation while removing salt and pepper 
noise, particularly in textureless regions. The chance of a mismatch is considerable in these places since their 
pixel values are identical in many different places. 


Table 1. Comparison of images before and after the introduction of EASF 
Image Ground truth No EASF With EASF 
Shelvs 


Error (%) 0.0 15.0 13.5 
S ji re oa 
Error (%) 0.0 22.8 21.8 
~ we Vie iv 
Error (%) 0.0 2.92 2.82 


The quantitative evaluation findings for non-occluded error (NON-OCC) are displayed in Table 2 
and for all error (ALL) in Table 3. Images of adirondack chair, motorcycle, are captured and reconstructed 
according to the disparity levels. The proposed method obtains the lowest error for NON-OCC error in Table 2 
for the recyc and teddy images, with 2.82% and 2.92% errors, respectively. This is followed by motor at 
3.94%, and motore at 3.98%. Some complex images, like shelvs and jadepl, had the lowest accuracy at 
13.5% and 21.8%, respectively, due to the system's inability to recognise repeated patterns, such as the 
textureless regions in the shelvs image or the small repetitive leaf clusters in the jadepl image. The method 
outlined in this article pinpointed the exact location of the disparity, despite the fact that in certain cases it 
may reconstruct a disparity map with noticeable discontinuities, such as the missing mug handle in adiron or 
the fractured rods in ArtL. Nevertheless, the disparity level is applied precisely when the distance contours 
are constant and correct matching is possible. The suggested technique reduces salt-and-pepper noise while 
maintaining the margin separating lines. The performance of the proposed work is illustrated in Tables 2 and 
3 examining it comparatively with a number of established and published methods. Table 4 compares the 
Middlebury ground truth to the output disparity map of the proposed system, along with the signed error for 
qualitative observation. The Middlebury investigation reveals that the recommended stereo corresponding 
approach may produce proper findings with an average non-occluded error of 6.79%. It demonstrates that the 
suggested method is comparable with recently published techniques and can be implemented to form an 
entire algorithm. 


Table 2. Comparative results for NON-OCC error using Middlebury v3 evaluation platform 


Method AvgErr (%) Adiron ArtL Jadepl Motor Piano Pipes Playrm Playt Recyc Shelvs Teddy Vintge 
MANet [41] 2.77 1.29 144 14.10 161 187 268 2.88 142 1.12 287 0.94 3.94 
SGMEPi [42] 4.57 1.72 336 9.72 1.79 3.20 3.66 3.48 2.78 2.09 8.04 1.75 26.40 
Proposed 6.79 436 7.08 21.80 3.94 484 752 485 622 2.82 13.50 2.92 8.46 
PSMNet_ROB [24] 9.60 7.32 969 44.50 5.55 5.01 9.86 7.33 440 3.73 11.10 344 8.07 
Z2ZNCC [43] 10.10 3.43 8.19 30.80 4.36 4.55 842 7.78 10.30 3.85 12.90 3.84 47.80 
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Table 3. Comparative results for ALL error using Middlebury v3 evaluation platform 


Method AvgErr (%) Adiron ArtL Jadepl Motor Piano Pipes Playrm Playt Recyc Shelvs Teddy Vintge 
MANet [41] 3.30 148 1.75 1490 2.11 2.13 495 3.83 167 1.28 2.99 1.23 4.69 
Proposed 12.90 8.68 13.10 52.70 8.80 632 16.00 8.79 11.10 5.79 14.70 4.83 11.70 
PSMNet_ROB [24] 13.30 8.83 13.90 68.40 8.26 5.89 1440 9.38 5.54 4.98 11.60 3.87 9.66 
SGMEPi [42] 13.40 5.65 18.20 30.80 9.18 849 15.80 21.00 10.70 5.80 11.00 10.70 31.90 
Z2ZNCC [43] 18.00 7.65 22.30 49.30 11.70 7.96 20.10 22.80 17.70 8.04 15.40 11.60 49.80 


Table 4. Ground truth of Middlebury v3 dataset compared to proposed algorithm output and error 
Image Adiron ArtL Jadepl Motor MotorE 
Ground truth Ht 


Proposed 


Signed error 


Image iano PianoL 
Ground truth 


Proposed 


Signed error 


Image 
Ground truth 


Proposed 


Signed error 


4. CONCLUSION 

This article presents a technique for stereo matching that combines a convolutional neural network 
(CNN) with an edge-aware smoothing filter (EASF). CNN learn features and cost functions from dataset to 
produce initial disparity maps. A smoothing filter preserves depth discontinuities and smooths homogeneous 
zones on these maps. Stereo matching in ill-posed regions is improved by integrating the benefits of both 
approaches. Experimental results demonstrate the effectiveness of the proposed framework. When evaluated 
using Middlebury benchmarking, the system delivers accurate results with an average NON-OCC error of 
6.79%. EASF is a class of nonlinear filter that can be used to smooth an image while reducing edge blurring 
effects such as halos and phantom edges. These filters preserve edge information while blurring an image, 
making them useful for improving depth estimation accuracy in stereo vision research by addressing 
challenges such as lighting, image complexity and pixel noise. In conclusion, this article presents a technique 
for stereo matching that combines CNN with EASF to improve matching accuracy in certain challenging 
regions. The proposed technique shows promising results and can be explored further to advance the field of 
stereo vision research. A more extensive discussion of the system's limitations and the nature of these 
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challenges is crucial. Understanding under what conditions the method may fail or perform suboptimally is 
essential for potential users and researchers. 
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